[SOLVED] [bash] find

hashbang#! · 11-17-2010, 07:19 AM

My recent recent optimization question seemed to have generated some interest.

So I thought I'd post my findings on another case for discussion. (I hope this is appropriate on this forum.)

Scenario
A folder contains about 280,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.

Code:

data/<yyyy>/<mm>/event_<id>.html

I need to build up a file just containing the files' id's excluding a particular monthly folder.

First attempt time: 4.6s

Code:

find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \) | \
grep -E -o -e [0-9]*

I then went back to basics adding to the above options to the command sequence to see where the time was spent.

find data -type f only takes an amazing 0.22s.
find data/ -path data/1975/01/ -prune -type f: 0.45s
find data/ -path data/1975/01/ -prune -o $ -type f -printf '%f\n' $: 0.6s

Clearly, the post-processing with grep used most of the time.

Optimization 1 time: 3.1s

Code:

find ... | grep -E -o -e [0-9]+

Optimization 2 time: 0.7s

Code:

find ... |  cut -d '_' -f 2| cut -d '.' -f 1

This obsoleted -printf: down to 0.61s

Optimization 3 time: 0.51s

Code:

find data -type f | grep -F -v "data/1975/01/" |  cut ...| cut ...

Unfortunately, the id's vary in length; otherwise I could have done it with one cut -c .

And just so we have a question here: how come cut is so efficient?

tuxdev · 11-18-2010, 06:15 PM

cut is efficient because it's not very powerful. It can't do nearly the same sort of stuff grep is capable of.

Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing.

Code:

while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done <(find ... | grep ...)

No clue how fast it is, though I suspect it isn't that different from your final version.

hashbang#! · 11-19-2010, 03:59 AM

Quote:

Originally Posted by tuxdev

Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing. [...]

(still need redirection in addition to process substitution)
Code:
while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done < <(find ... | grep ...)

Quote:

Originally Posted by tuxdev

No clue how fast it is, though I suspect it isn't that different from your final version.

time: 27s (compared to find | grep | cut | cut @ 0.6s)
I know the double cut looks a bit of a cludge ...

catkin · 11-19-2010, 04:40 AM

27s or 0.27s?

hashbang#! · 11-19-2010, 04:47 AM

Quote:

Originally Posted by catkin

27s or 0.27s?

27s! Otherwise I would have congratulated tuxdev!

catkin · 11-19-2010, 05:06 AM

My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:

Code:

while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done <<< "$(find ... | grep ...)"

hashbang#! · 11-19-2010, 05:51 AM

Quote:

Originally Posted by catkin

how does this variant perform:

Code:

while read file ; do
    [...]
done <<< "$(find ... | grep ...)"

31s

I have played around with bash loops on similar situations and found them to be slower than any built-in recursion/looping.

grail · 11-19-2010, 08:42 AM

What about taking the grep out and doing something like:

Code:

while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")

hashbang#! · 11-19-2010, 10:45 AM

Quote:

Originally Posted by grail

What about taking the grep out and doing something like:

Code:

while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")

The find on its own takes 3.86s. With the while loop it takes 27s.

If you look at my original post, I took prune and printf out to save time.

find -regex is slower than find | egrep (see post #1, Optimization 1). Also, bear in mind that grep -o already produced the desired output (id only).

ntubski · 11-19-2010, 10:56 AM

Quote:

Originally Posted by catkin

My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:

Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.

I was surprised that the -prune wasn't faster, but I think it's because there's only 1 directory to prune so it doesn't save much. And since -path takes glob patterns it's slower than grep -F which take fixed strings.

catkin · 11-19-2010, 01:04 PM

Quote:

Originally Posted by ntubski

Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.

Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?

ntubski · 11-19-2010, 02:06 PM

Quote:

Originally Posted by catkin

Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?

Well looking at read.def I see

Code:

unbuffered_read = (nchars > 0) || (delim != '\n') || input_is_pipe;

So it appears that read actually isn't buffering. I'm not certain, the code deals with a lot of different options so it's hard to follow.

However, I don't think it matters either way, here is a while loop without read:

Code:

#!/bin/bash
COUNT=40000 # maybe increase this if your computer is faster than mine.

echo use bash loop
declare -i i=0
time while ((i<COUNT)) ; do ((i++)) ; done

echo ========
echo use seq
time seq $COUNT >/dev/null

For me the while loop takes around 1.3 seconds and the seq takes 0.1 seconds. bash is over 10 times as slow and that's after I took out the echo statement.

catkin · 11-19-2010, 09:39 PM

Those are very educational figures, ntubski, and show the importance of measurement instead of blindly accepting received wisdom. For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.

On my system the output from your script was:

Code:

use bash loop

real    0m0.426s
user    0m0.405s
sys    0m0.010s
========
use seq

real    0m0.071s
user    0m0.041s
sys    0m0.004s

I modified it to also test an equivalent for ((i=0; i<COUNT; i++)) loop and found it ~10% faster than the while loop. Then I added a call to /bin/true in the loop and found the /bin/true calls took ~0.0001 s each which is only ~100 times longer than each bare loop.

Presumably there is some caching and hashing which means the first call to /bin/true could have taken significantly longer than 0.0001 s.

ntubski · 11-20-2010, 11:41 AM

Quote:

Originally Posted by catkin

For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.

Well they are, but in the example here there are O(n) in-shell actions but O(1) external calls, so for large n the fork+exec overhead is negligible.